14 research outputs found
Unsupervised Adversarial Depth Estimation using Cycled Generative Networks
While recent deep monocular depth estimation approaches based on supervised
regression have achieved remarkable performance, costly ground truth
annotations are required during training. To cope with this issue, in this
paper we present a novel unsupervised deep learning approach for predicting
depth maps and show that the depth estimation task can be effectively tackled
within an adversarial learning framework. Specifically, we propose a deep
generative network that learns to predict the correspondence field i.e. the
disparity map between two image views in a calibrated stereo camera setting.
The proposed architecture consists of two generative sub-networks jointly
trained with adversarial learning for reconstructing the disparity map and
organized in a cycle such as to provide mutual constraints and supervision to
each other. Extensive experiments on the publicly available datasets KITTI and
Cityscapes demonstrate the effectiveness of the proposed model and competitive
results with state of the art methods. The code and trained model are available
on https://github.com/andrea-pilzer/unsup-stereo-depthGAN.Comment: To appear in 3DV 2018. Code is available on GitHu
Viraliency: Pooling Local Virality
In our overly-connected world, the automatic recognition of virality - the
quality of an image or video to be rapidly and widely spread in social networks
- is of crucial importance, and has recently awaken the interest of the
computer vision community. Concurrently, recent progress in deep learning
architectures showed that global pooling strategies allow the extraction of
activation maps, which highlight the parts of the image most likely to contain
instances of a certain class. We extend this concept by introducing a pooling
layer that learns the size of the support area to be averaged: the learned
top-N average (LENA) pooling. We hypothesize that the latent concepts (feature
maps) describing virality may require such a rich pooling strategy. We assess
the effectiveness of the LENA layer by appending it on top of a convolutional
siamese architecture and evaluate its performance on the task of predicting and
localizing virality. We report experiments on two publicly available datasets
annotated for virality and show that our method outperforms state-of-the-art
approaches.Comment: Accepted at IEEE CVPR 201
Reproducibility is Nothing without Correctness: The Importance of Testing Code in NLP
Despite its pivotal role in research experiments, code correctness is often
presumed only on the basis of the perceived quality of the results. This comes
with the risk of erroneous outcomes and potentially misleading findings. To
address this issue, we posit that the current focus on result reproducibility
should go hand in hand with the emphasis on coding best practices. We bolster
our call to the NLP community by presenting a case study, in which we identify
(and correct) three bugs in widely used open-source implementations of the
state-of-the-art Conformer architecture. Through comparative experiments on
automatic speech recognition and translation in various language settings, we
demonstrate that the existence of bugs does not prevent the achievement of good
and reproducible results and can lead to incorrect conclusions that potentially
misguide future research. In response to this, this study is a call to action
toward the adoption of coding best practices aimed at fostering correctness and
improving the quality of the developed software
Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training
The use of self-supervised pre-training has emerged as a promising approach
to enhance the performance of visual tasks such as image classification. In
this context, recent approaches have employed the Masked Image Modeling
paradigm, which pre-trains a backbone by reconstructing visual tokens
associated with randomly masked image patches. This masking approach, however,
introduces noise into the input data during pre-training, leading to
discrepancies that can impair performance during the fine-tuning phase.
Furthermore, input masking neglects the dependencies between corrupted patches,
increasing the inconsistencies observed in downstream fine-tuning tasks. To
overcome these issues, we propose a new self-supervised pre-training approach,
named Masked and Permuted Vision Transformer (MaPeT), that employs
autoregressive and permuted predictions to capture intra-patch dependencies. In
addition, MaPeT employs auxiliary positional information to reduce the
disparity between the pre-training and fine-tuning phases. In our experiments,
we employ a fair setting to ensure reliable and meaningful comparisons and
conduct investigations on multiple visual tokenizers, including our proposed
-CLIP which directly employs discretized CLIP features. Our results
demonstrate that MaPeT achieves competitive performance on ImageNet, compared
to baselines and competitors under the same model setting. Source code and
trained models are publicly available at: https://github.com/aimagelab/MaPeT
A Look at Improving Robustness in Visual-inertial SLAM by Moment Matching
The fusion of camera sensor and inertial data is a leading method for
ego-motion tracking in autonomous and smart devices. State estimation
techniques that rely on non-linear filtering are a strong paradigm for solving
the associated information fusion task. The de facto inference method in this
space is the celebrated extended Kalman filter (EKF), which relies on
first-order linearizations of both the dynamical and measurement model. This
paper takes a critical look at the practical implications and limitations posed
by the EKF, especially under faulty visual feature associations and the
presence of strong confounding noise. As an alternative, we revisit the assumed
density formulation of Bayesian filtering and employ a moment matching
(unscented Kalman filtering) approach to both visual-inertial odometry and
visual SLAM. Our results highlight important aspects in robustness both in
dynamics propagation and visual measurement updates, and we show
state-of-the-art results on EuRoC MAV drone data benchmark.Comment: 8 pages, to appear in Proceedings of FUSION 202
A Look at Improving Robustness in Visual-inertial SLAM by Moment Matching
Publisher Copyright: © 2022 International Society of Information Fusion.The fusion of camera sensor and inertial data is a leading method for ego-motion tracking in autonomous and smart devices. State estimation techniques that rely on nonlinear filtering are a strong paradigm for solving the associated information fusion task. The de facto inference method in this space is the celebrated extended Kalman filter (EKF), which relies on first-order linearizations of both the dynamical and measurement model. This paper takes a critical look at the practical implications and limitations posed by the EKF, especially under faulty visual feature associations and the presence of strong confounding noise. As an alternative, we revisit the assumed density formulation of Bayesian filtering and employ a moment matching (unscented Kalman filtering) approach to both visual-inertial odometry and visual SLAM. Our results highlight important aspects in robustness both in dynamics propagation and visual measurement updates, and we show state-of-the-art results on EuRoC MAV drone data benchmark.Peer reviewe
Unsupervised Adversarial Depth Estimation Using Cycled Generative Networks
While recent deep monocular depth estimation approaches based on supervised regression have achieved remarkable performance, costly ground truth annotations are required during training. To cope with this issue, in this paper we present a novel unsupervised deep learning approach for predicting depth maps and show that the depth estimation task can be effectively tackled within an adversarial learning framework. Specifically, we propose a deep generative network that learns to predict the correspondence field (i.e. the disparity map) between two image views in a calibrated stereo camera setting. The proposed architecture consists of two generative sub-networks jointly trained with adversarial learning for reconstructing the disparity map and organized in a cycle such as to provide mutual constraints and supervision to each other. Extensive experiments on the publicly available datasets KITTI and Cityscapes demonstrate the effectiveness of the proposed model and competitive results with state of the art methods. The code is available at https://github.com/andrea-pilzer/unsup-stereo-depthGAN
Progressive Fusion for Unsupervised Binocular Depth Estimation using Cycled Networks
Accepted to TPAMI (SI RGB-D Vision), code https://github.com/andrea-pilzer/PFN-depthInternational audienceRecent deep monocular depth estimation approaches based on supervised regression have achieved remarkable performance. However, they require costly ground truth annotations during training. To cope with this issue, in this paper we present a novel unsupervised deep learning approach for predicting depth maps. We introduce a new network architecture, named Progressive Fusion Network (PFN), that is specifically designed for binocular stereo depth estimation. This network is based on a multi-scale refinement strategy that combines the information provided by both stereo views. In addition, we propose to stack twice this network in order to form a cycle. This cycle approach can be interpreted as a form of data-augmentation since, at training time, the network learns both from the training set images (in the forward half-cycle) but also from the synthesized images (in the backward half-cycle). The architecture is jointly trained with adversarial learning. Extensive experiments on the publicly available datasets KITTI, Cityscapes and ApolloScape demonstrate the effectiveness of the proposed model which is competitive with other unsupervised deep learning methods for depth prediction
Progressive Fusion for Unsupervised Binocular Depth Estimation using Cycled Networks
Accepted to TPAMI (SI RGB-D Vision), code https://github.com/andrea-pilzer/PFN-depthInternational audienceRecent deep monocular depth estimation approaches based on supervised regression have achieved remarkable performance. However, they require costly ground truth annotations during training. To cope with this issue, in this paper we present a novel unsupervised deep learning approach for predicting depth maps. We introduce a new network architecture, named Progressive Fusion Network (PFN), that is specifically designed for binocular stereo depth estimation. This network is based on a multi-scale refinement strategy that combines the information provided by both stereo views. In addition, we propose to stack twice this network in order to form a cycle. This cycle approach can be interpreted as a form of data-augmentation since, at training time, the network learns both from the training set images (in the forward half-cycle) but also from the synthesized images (in the backward half-cycle). The architecture is jointly trained with adversarial learning. Extensive experiments on the publicly available datasets KITTI, Cityscapes and ApolloScape demonstrate the effectiveness of the proposed model which is competitive with other unsupervised deep learning methods for depth prediction